The Annals of Applied Statistics 
2010, Vol. 4, No. 4, 1698-1721 
DOI: 10.1214/10-AOAS345 

© Institute of Mathematical Statistics. 2010 



MULTICATEGORY VERTEX DISCRIMINANT ANALYSIS FOR 
HIGH-DIMENSIONAL DATA 

By Tong Tong Wu 1 and Kenneth Lange 2 
University of Maryland and University of California 

In response to the challenges of data mining, discriminant anal- 
ysis continues to evolve as a vital branch of statistics. Our recently 
introduced method of vertex discriminant analysis (VDA) is ideally 
suited to handle multiple categories and an excess of predictors over 
training cases. The current paper explores an elaboration of VDA 
that conducts classification and variable selection simultaneously. 
Adding lasso (£i-norm) and Euclidean penalties to the VDA loss 
function eliminates unnecessary predictors. Lasso penalties apply to 
each predictor coefficient separately; Euclidean penalties group the 
collective coefficients of a single predictor. With these penalties in 
place, cyclic coordinate descent accelerates estimation of all coeffi- 
cients. Our tests on simulated and benchmark real data demonstrate 
the virtues of penalized VDA in model building and prediction in 
high-dimensional settings. 

1. Introduction. Despite its long history, discriminant analysis is still 
undergoing active development. Four forces have pushed classical methods 
to the limits of their applicability: (a) the sheer scale of modern datasets, 
(b) the prevalence of multicategory problems, (c) the excess of predictors 
over cases, and (d) the exceptional speed and memory capacity of mod- 
ern computers. Computer innovations both solve and drive the agenda of 
data mining. What was unthinkable before has suddenly become the fo- 
cus of considerable mental energy. The theory of support vector machines 
(SVM) is largely a response to the challenges of binary classification. SVMs 
implement a geometric strategy of separating two classes by an optimal hy- 
perplane. This simple paradigm breaks down in passing from two classes to 
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multiple classes. The one-versus-rest (OVR) remedy reduces classification 
with k categories to binary classification. Unfortunately, OVR can perform 
poorly when no dominating class exists [Lee, Lin and Wahba (2004)]. The 



value, but it constitutes an even more egregious violation of the criterion of 
parsimony. In the opinion of many statisticians, simultaneous classification 
is more satisfying theoretically and practically. This attitude has prompted 
the application of hinge loss functions in multicategory SVM [Bredensteiner 
and Bennett (1999); Crammer and Singer (2001); Guermeur (2002); Lee, 
Lin and Wahba (2004); Liu, Shen and Doss (2005, 2006); Liu (2007); Vap- 
nik (1998); Weston and Watkins (1999); Zhang (2004b); Zou, Zhu and Hastie 
(2006); Yuan, Joseph and Zou (2009)]. 

Our earlier paper [Lange and Wu (2008)] introduced a new method of mul- 
ticategory discriminant analysis that shares many of the attractive proper- 
ties of multicategory SVM under hinge loss. These properties include simple 
linear prediction of class vertices, creation of dead regions where predictions 
incur no loss, and robustness to outliers. Our vertex discriminant analy- 
sis (VDA) procedure has the advantage of operating in (k — l)-dimensional 
space rather than in /c-dimensional space. Each class is represented by a 
vertex of a regular simplex, with the vertices symmetrically arranged on the 
surface of the unit ball in These conventions emphasize symmetry, 

eliminate excess parameters and constraints, and simplify computation and 
model interpretation. For hinge loss we substitute e-insensitive loss. Both 
loss functions penalize errant predictions; the difference is that hinge loss 
imposes no penalty when wild predictions fall on the correct side of their 
class indicators. A generous value of e makes e-insensitive loss look very 
much like hinge loss. In addition, e-insensitive loss enjoys a computational 
advantage over hinge loss in avoiding constraints. This makes it possible to 
implement rapid coordinate descent. Class assignment in VDA is defined by 
a sequence of conical regions anchored at the origin and surrounding the 
class vertices. 

Modern methods of discriminant analysis such as VDA are oriented to 
data sets where the number of predictors p is comparable to or larger than 
the number of cases n. In such settings it is prudent to add penalties that 
shrink parameter estimates to 0. Our paper [Lange and Wu (2008)] imposes 
a ridge penalty to avoid overfitting. Although shrinkage forces a predicted 
point toward the origin, the point tends to stay within its original conical 
region. Hence, no correction for parameter shrinkage is needed, and the perils 
of underprediction are mitigated. A ridge penalty also adapts well to an MM 
algorithm for optimizing the loss function plus penalty. 

Motivated by problems such as cancer subtype classification, where the 
number of predictors p far exceeds the number of observations n, we resume 




[Kressel (1999)] has 



MULTICATEGORY VDA FOR HIGH-DIMENSIONAL DATA 



3 



our study of VDA in the current paper. In this setting conventional methods 
of discriminant analysis prescreen predictors [Dudoit, Fridlyand and Speed 
(2002); Li, Zhang and Ogihara (2004); Li, Zhang and Jiang (2005);] before 
committing to a full analysis. Wang and Shen (2007) argue against this 
arbitrary step of univariate feature selection and advocate imposing a lasso 
penalty. Ridge penalties are incapable of feature selection, but lasso penalties 
encourage sparse solutions. Consumers of statistics are naturally delighted 
to see classification reduced to a handful of predictors. In our experience, it 
is worth adding further penalties to the loss function. Zhang et al. (2008) 
suggest £oo penalties that tie together the regression coefficients pertinent 
to a single predictor. In our setting Euclidean penalties achieve the same 
goal and preserve spherical symmetry. By design l\ penalties and Euclidean 
penalties play different roles in variable selection. One enforces sparsity of 
individual variables, while the other enforces sparsity of grouped variables. 
In the sequel we denote our original VDA with a ridge penalty as VDAr, 
the modified VDA with a lasso penalty as VDAl, the modified VDA with 
a Euclidean penalty as VDAe, and the modified VDA with both lasso and 
Euclidean penalties as VDAle- The same subscripts will be attached to the 
corresponding penalty tuning constants. 

A second objection to VDAr as it currently stands is that the compu- 
tational complexity of the underlying MM algorithm scales as 0(p 3 ). This 
computational hurdle renders high-dimensional problems intractable. Al- 
though substitution of lasso penalties for ridge penalties tends to complicate 
optimization of the objective function, prior experience with lasso penalized 
regression [Friedman et al. (2007); Wu and Lange (2008)] suggests updat- 
ing one parameter at a time. We implement cyclic coordinate descent by 
repeated application of Newton's method in one dimension. In updating a 
single parameter by Newton's method, one can confine attention to the in- 
tervals to the left or right of the origin and ignore the kink in the lasso. The 
kinks in e-insensitive loss are another matter. We overcome this annoyance 
by slightly smoothing the loss function. This maneuver preserves the rele- 
vant properties of e-insensitive loss and leads to fast reliable optimization 
that can handle thousands of predictors with ease. Once the strength of the 
lasso penalty is determined by cross-validation, model selection is complete. 

In practice, cross-validation often yields too many false positive predic- 
tors. This tendency has prompted Meinshausen and Buehlmann (2010) to 
introduce the method of stability selection, which requires selected predic- 
tors to be consistently selected across random subsets of the data. Here we 
demonstrate the value of stability selection in discriminant analysis. Because 
our revised versions of VDA are fast, the 100-fold increase in computing de- 
manded by stability selection is manageable. 

Before summarizing the remaining sections of this paper, let us mention 
its major innovations: (a) the new version of VDA conducts classification 
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and variable selection simultaneously, while the original VDA simply ignores 
variable selection; (b) coordinate descent is substituted for the much slower 
MM algorithm, (c) e-insensitive loss is approximated by a smooth loss to ac- 
commodate Newton's method in coordinate descent, (d) Fisher consistency 
is established, (e) a grouped penalty is added, and (f ) the new VDA is tested 
on a fairly broad range of problems. These changes enhance the conceptual 
coherence, speed, and reliability of VDA. 

The rest of the paper is organized as follows. Section 2 precisely formu- 
lates the VDA model, reviews our previous work on VDAr, and derives cyclic 
coordinate descent updates for VDAl- Section 3 introduces the Euclidean 
penalty for grouped predictors and sketches the necessary modifications of 
cyclic coordinate descent. Section 4 takes a theoretical detour and shows that 
e-insensitive loss is Fisher consistent. By definition, Fisher consistent classi- 
fiers satisfy the Bayes optimal decision rule. There is already a considerable 
literature extending previous proofs of Fisher consistency from binary clas- 
sification to multicategory classification [Lee, Lin and Wahba (2004); Liu, 
Shen and Doss (2006); Liu (2007); Wang and Shen (2007); Zhang (2004a); 
Zou, Zhu and Hastie (2008)]. Section 5 quickly reviews the basics of stability 
selection. Sections 6 and 7 report our numerical tests of VDA on simulated 
and real data. Section 8 concludes with a brief summary and suggestions for 
further research. 

2. Modified vertex discriminant analysis. 

2.1. Ridge penalized vertex discriminant analysis (VDAr). Vertex dis- 
criminant analysis (VDA) is a novel method of multicategory supervised 
learning [Lange and Wu (2008)]. It discriminates among categories by mini- 
mizing e-insensitive loss plus a penalty. For reasons of symmetry, the vertices 
corresponding to the different classes are taken to be equidistant. With two 
categories, the points —1 and 1 on the real line suffice for discrimination. 
With three categories there is no way of choosing three equidistant points 
on the line. Therefore, we pass to the plane and choose the vertices of an 
equilateral triangle. In general with k > 3 categories, we choose the vertices 
vi,. . . ,Vk of a regular simplex in IR^" 1 . Among the many ways of construct- 
ing a regular simplex, we prefer the simple definition 
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and ej is the jth coordinate vector in R fc-1 . This puts the vertices on the 
surface of the unit ball in It is impossible to situate more than k 

equidistant points in R fc_1 . 

Suppose Y and X denote the class indicator and feature vector of a ran- 
dom case. The vector Y coincides with one of the vertices of the simplex. 
Given a loss function L(y,x), discriminant analysis seeks to minimize the 
expected loss 

E[L(Y,X)]=E{E[L(Y,X)\X]}. 

This is achieved empirically by minimizing the average conditional loss n _1 x 
^" =1 L(yj, Xj). To maintain parsimony, VDA postulates the linear regression 
model y = Ax + b, where A = (aji) is a (/c — 1) xp matrix of slopes and b = (bj) 
is a k — 1 column vector of intercepts. Overfitting is avoided by imposing 
penalties on the slopes aji but not on the intercepts bj. In VDA we take the 
loss function for case i to be g(yi — Ax, — b), where g(z) is the e-insensitive 
Euclidean distance 

g(z) = \\z\\2,e = max{||z|| 2 - e,0}. 
Classification proceeds by minimizing the objective function 

1 n 

(2.2) f{9) = - YVj/i - Ax, -b) + XP(A), 

n i 

i=i 

where 9 = (A,b), and P{A) is the penalty on the matrix of slopes A. Since 
the loss function is convex, it is clearly advantageous to take P(A) to be 
convex as well. In VDAr the ridge penalty P (A) = ^ ■ ^ is employed. 
Because of its near strict convexity, the objective function f(9) usually has 
a unique minimum. Once A and b are estimated, we can assign a new case 
to the closest vertex, and hence category. 

For prediction purposes, VDAr is competitive in statistical accuracy and 
computational speed with the best available algorithms for discriminant 
analysis [Lange and Wu (2008)]. Unfortunately, it suffers two limitations. 
First, although it shrinks estimates toward 0, it is incapable of model se- 
lection unless one imposes an arbitrary cutoff on parameter magnitudes. 
Second, its computational complexity scales as 0(p 5 ) for p predictors. This 
barrier puts problems with ten of thousands of predictors beyond its reach. 
Modern genomics problems involve hundreds of thousands to millions of 
predictors. The twin predicaments of model selection and computational 
complexity have prompted us to redesign VDA with different penalties and 
a different optimization algorithm. 
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2.2. A toy example for vertex discriminant analysis. The use of e-insen- 
sitive loss is based on the assumption that it makes little difference how 
close a linear predictor is to its class indicator when an observation is cor- 
rectly classified. Here e is the radius of the circle/ball around each vertex. 
Training observations on the boundary or exterior of the e-insensitive balls 
act as support vectors and exhibit sensitivity. Observations falling within 
an e-insensitive ball exhibit insensitivity and do not directly contribute to 
the estimation of regression coefficients. The definition of e-insensitive loss 
through Euclidean distance rather than squared Euclidean distance makes 
classification more resistant to outliers. The following small simulation ex- 
ample demonstrates the importance of creating the dead zones where obser- 
vations receive a loss of 0. These zones render estimation and classification 
highly nonlinear. 

We generated 300 training observations equally distributed over k = 3 
classes. To each observation i we attached a normally distributed predictor 
Xi with variance 1 and mean 

{—4, class = 1, 
0, class = 2, 
4, class = 3. 

We then compared four methods: (a) least squares with class indicators 
Vj equated to the standard unit vectors ej in R 3 (indicator regression); (b) 
least squares with class indicators Vj equated to the vertices of an equilateral 
triangle inscribed on the unit circle as described in (2.1); (c) e-insensitive 
loss with the triangular vertices and e = 0.6; and (d) e-insensitive loss with 
the triangular vertices and e = 1/2-^/2/c/ \k — 1) = 0.866. Because there is 
only a single predictor, all four methods omit penalization. As advocated in 
Lange and Wu (2008), method (d) adopts the maximum value of e consistent 
with nonoverlapping balls. 

Figure 1 plots the three distances Xi — > \\y~i — v j\\ between the predicted 
value iji for observation i and each of the three vertices Vj. An observa- 
tion is assigned to the class whose vertex is closest. It is evident from these 
plots that squared Euclidean loss fails to identify class 2, which is dom- 
inated and masked by the other two classes (upper two panels of Figure 
1). With surrounding balls of small radius, class 2 can be identified but 
the misclassification rate is high (13%, lower left plot). With surrounding 
balls of the maximum legal radius, e-insensitive loss readily distinguishes 
all three classes with a low misclassification rate (2.67%, lower right plot). 
This example nicely illustrates the importance of the dead zones integral to 
e-insensitive loss. Our previous paper [Lange and Wu (2008)] reaches essen- 
tially the same conclusions by posing discrimination with three classes as a 
problem in one-dimensional regression. Section 4 discusses how e-insensitive 
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Squared Euclidean Loss with ej 




-6 -4 -2 2 4 6 
X 

e - insensitive Loss with e = 0.6 




Squared Euclidean Loss with Vertices 




-6 -4 -2 2 4 6 
X 

e - insensitive Loss with e = 0.866 




Fig. 1. Distance to class indicators. The upper left plot shows observed distances un- 
der squared Euclidean loss with class indicators Vj equated to the standard unit vectors in 
R 3 . The upper right plot shows observed distances under squared Euclidean loss with class 
indicators equated to the vertices of an equilateral triangle. The lower left plot shows ob- 
served distances under e-insensitive loss with the triangular vertices and e = 0.6. Finally, 
the lower right plot shows observed distances under e-insensitive loss with the triangular 
vertices and e = l/2^/2fc/(fc — 1) = 0.866. In the lower right plot, it is clear that obser- 
vations with x < —2 will be predicted as class 1 (black), observations with x > 2 will be 
predicted as class 3 (green), and observations with —2 < x < 2 will be predicted as class 2 
(red). This is consistent with the true classes shown on the x-axis. 

loss achieves Fisher consistency. The dead zones figure prominently in the 
derivation of consistency. 

In these four examples masking is neutralized. Because our proof of Fisher 
consistency requires nonlinear as well as linear functions, the possibility 
of masking still exists in practice. Inclusion of nonlinear combinations of 
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predictors, say products of predictors, may remedy the situation. Of course, 
creating extra predictors highlights the need for rigorous model selection 
and fast computation. 

2.3. Modified e-insensitive loss. The kinks in e-insensitive loss have the 
potential to make Newton's method behave erratically in cyclic coordinate 
descent. It is possible to avoid this pitfall by substituting a similar loss 
function that is smoother and still preserves convexity. Suppose f(s) is an 
increasing convex function defined on [0,oo). If ||x|| denotes the Euclidean 
norm of x, then the function /(||x||) is convex. This fact follows from the 
inequalities 

f[\\ax + (l-a)y\\]<f[a\\x\\ + (l-a)\\y\\] 

<a/(|H|) + (l-a)/(||y||) 

for a € [0,1]. It seems reasonable to perturb the e-insensitive function as 
little as possible. This suggests eliminating the corner near s = e. Thus, we 
define f(s) to be on the interval [0, e — 8), a polynomial on the interval 
[e — 5,e + 5], and s — e on the interval (e + 5, oo) . Here we obviously require 
0<6<e. 

There are two good candidate polynomials. The first is the quadratic 

(s-e + S) 2 

P2{S) = 45 • 

This function matches the function values and the first derivatives of the 
two linear pieces at the join points e — 5 and e + 6. Indeed, brief calculations 
show that 

p 2 (e-5) = 0, P ' 2 (e-5) = 0, p 2 (e + 5) = 5, p' 2 (e + S) = l. 

Unfortunately, the second derivative p 2 (s) = (2<5) -1 does not match the van- 
ishing second derivatives of the two linear pieces at the join points. Clearly, 
P2(s) is increasing and convex on the open interval (e — 6, s + S). 
A more complicated choice is the quartic polynomial 

, . (s-e + 5) 3 (35-s + e) 

= 1^3 • 

Now we have 

p 4 (e - S) = 0, p' 4 {e - 5) = 0, p£(e - 8) = 0, 
P4 (e + 5)=5, p' 4 (e + 8) = l, j#(e + 5) = 0. 

Both the first and second derivatives 
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are positive throughout the open interval (e — 6, e + S) . The second derivative 
attains its maximum value of J? at the midpoint e. Thus, Pi(s) is increasing 
and convex on the same interval. We now write p{s) for the function equal 
to on [0, e — 5), to ^4(5) on [e — <5,e + S], and to s — e on (e + 5, 00). 

2.4. Cyclic coordinate descent. In our modified VDA the alternative loss 
function p(\\yi — Axi — b\\) is twice continuously differentiable. This has the 
advantage of allowing us to implement Newton's method. If we abbreviate 
Hi — Axi — b by r j , then applying the chain rule repeatedly yields the partial 
derivatives 



The only vector operation required to form these partial derivatives is com- 
putation of the norm ||rj||. As long as the number of categories is small and 
we update the residuals ri as we go, the norms are quick to compute. 
Our overall objective function f{9) is given in (2.2) with 



replacing the e-insensitive loss g(v) = ||z||2,e throughout. To minimize this 
objective function in the presence of a large number of predictors, we use the 
cyclic version of coordinate descent highlighted by Friedman et al. (2007) 
and Wu and Lange (2008). Cyclic coordinate descent avoids the bottlenecks 
of ordinary regression, namely matrix diagonalization, matrix inversion, and 
the solution of large systems of linear equations. It is usually fast and always 
numerically stable for smooth convex objective functions. 

Consider now the convex lasso penalty P{A) = YljZi Ta=i \ a ji\- Although 
the objective function f(6) is nondifferentiable, it possesses forward and 
backward directional derivatives along each coordinate direction. If eji is the 
coordinate direction along which aji varies, then the forward and backward 








if \\v\\2 > e + 5, 

if ||f||2 e [e — 5,e + 5] 
if ||t>||2 < e — 5, 
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directional derivatives are 



de 3 J{0) 



lim 

t\0 



f(e + re 3l )-f(9) 



T 



1 



n 



d 
daji 



g(ri) + (-1)"°' ,<0) A 



n 



i=l 



and 



lim 

Ti-0 



T 




where /(•) is an indicator function taking value 1 if the condition in the 
parentheses is satisfied and otherwise. 

Newton's method for updating a single intercept parameter of f(6) works 
well because there is no lasso penalty. For a slope parameter aji, the lasso 
penalty intervenes, and we must take particular care at the origin. If both of 
the directional derivatives d ejl f(8) and d- ejl f(6) are nonnegative, then the 
origin furnishes the minimum of the objective function along the eji coordi- 
nate. If either directional derivative is negative, then we must solve for the 
minimum in the corresponding direction. Both directional derivatives cannot 
be negative because this contradicts the convexity of f(0). In practice, we 
start all parameters at the origin. For underdetermined problems with just 
a few relevant predictors, most updates are skipped, and many parameters 
never budge from their starting values of 0. This simple fact plus the com- 
plete absence of matrix operations explains the speed of cyclic coordinate 
descent. It inherits its numerical stability from the descent property of each 
update. 

Newton's method for updating aji iterates according to 



where r™ is the value of the ith residual at iteration m. In general, one should 
check that the objective function is driven downhill. If the descent property 
fails, then the simple remedy of step halving is available. The Newton update 
for an intercept is 



a 






(Vn)E 



n d 2 



9{r?) 
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3. Penalty for grouped effects. 

3.1. Euclidean penalty. In model selection it is often desirable to im- 
pose coordinated penalties that include or exclude all of the parameters in 
a group. In multicategory classification, the slopes of a single predictor for 
different dimensions of M. k ~~ 1 form a natural group. In other words, the pa- 
rameter group for predictor I is the lib. column a/ = (an, . . . , ajs-ij)* of the 
slope matrix A. The lasso penalty Ai||aj||i and the ridge penalty Aj?||a/||| 
separate parameters and do not qualify as sensible group penalties. The 
scaled Euclidean norm A^||oj||2 is an ideal group penalty since it couples 
parameters and preserves convexity [Wu and Lange (2008); Wu, Zou and 
Yuan (2008)]. 

The Euclidean penalty possesses several other desirable features. First, it 
reduces to a lasso penalty X\aji\ on aji whenever a m i = for m^=j. This 
feature of the penalty enforces parsimony in model selection. Second, the 
Euclidean penalty is continuously differentiable in ai whenever a/ is nontriv- 
ial. Third, the Euclidean penalty is spherically symmetric. This makes the 
specific orientation of the simplex irrelevant. If one applies an orthogonal 
transformation O to the simplex, then the transformed vertices Oy are still 
equidistant. Furthermore, the new and old versions of the objective functions 
satisfy 



p 

1 '\ a l\\2 



- E 9 ^ Vi ~ Axi - b) + \ E ^T\ 
n i=i i=i 

n p 

= ~ E 9(Oyi ~ OAxi - Ob) + X E £ \\O ai || . 
i=i i=i 

Thus, any minimum of one orientation is easily transformed into a minimum 
of the other, and predictors active under one orientation are active under 
the other orientation. For instance, if the estimates for the original objec- 
tive function are A and b, then the estimates for the transformed objective 
function are OA and Ob. 

3.2. Coordinate descent under a Euclidean penalty. In modified VDA 
with grouped effects, we minimize the objective function 

n k—l p p 

(3.i) f(8) = Y,9(vi -Axi-b)+\ L j2j2 M + Xe E ii°*H 2 - 

i=l j=l 1=1 1=1 

If the tuning parameter for the Euclidean penalty A^ = 0, then the penalty 
reduces to the lasso. On the other hand, when the tuning parameter for the 
lasso penalty Al = 0, only group penalties enter the picture. Mixed penalties 
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with Al > and Xe > enforce shrinkage in both ways. All mixed penalties 
are norms on A and therefore convex functions. 

The partial derivatives of the Euclidean penalty are similar to those of 
the loss function g(v). There are two cases to consider. If ||a;|| = 0, then the 
forward and backward derivatives of A_e 1 1 a/ 1 1 with respect to aji are both Xe- 
The forward and backward second derivatives vanish. If \\ai\\ > 0, then ||a;|| 
is differentiable and 

5 \ ii ii \ a i l 
-AE\\ai\\ = ae-. 



dciji \\ai\\ 
Q2 x II II X e 4 

° a n \\ a i\\ V r IHn 



4. Fisher consistency of e-insensitive loss. A loss function L(y,x) is 
Fisher consistent if minimizing its average risk ~E{L[f(X), Y)]} leads to the 
Bayes optimal decision rule. Fisher consistency is about the least one can 
ask of a loss function. Our previous development of VDA omits this crucial 
topic, so we take it up now for e-insensitive loss without a penalty. The 
empirical loss minimized in VDA is 

-i n 1 n 

EML n (L, /) = -TL[f( Xl ), yi ] = - V \\ yi - f( Xi )\\ £ , 
i=i i=l 

and the VDA classifier is obtained by solving 

/ = arg min EML n (L, /), 

where J- n is the space of linear functions in the predictor matrix (xij). This 
space is determined by the slope matrix A and the intercept vector b. Once 
these are estimated, we assign a new case to the class attaining 

(4.1) argmin \\vj — f\\ = argmin \\vj — Ax — b\\. 

je{i,...,k} je{i,...,k} 

If we define Pj(x) = Pr(y = Vj\X = x) to be the conditional probability of 
class j given feature vector x, then Fisher consistency demands that the 
minimizer f*(x) = argminE[||V — /(X)|| E | X = x] satisfy 

argmin \\vj — = argmaxpj(x). 

je{i,...,k} je{i,...,k} 

Here distance is ordinary Euclidean distance, and f*(x) is not constrained 
to be linear in x. In the Supplementary File [Wu and Lange (2010)] we prove 
the following proposition. 
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Fig. 2. Contour plots of X^ Pjll^j — z \\z with k = 3 and e = \y/k/{k — 1) for different 
p's. Upper left panel p — (1/3,1/3,1/3) , upper right panel p = (0.37, 0.37, 0.26), lower left 
panel p — (0.6, 0.3, 0.1), and lower right panel p = (1/3 + t, 1/3 — 0.25f , 1/3 — 0.75t) with 
t = 0.025. 



Pro position 1 . If a minimizer f*(x) of E[\\Y — f(X)\\ e \ X = x] with 
e = \\j2kj(k — 1) lies closest to vertex v\, then pi(x) = max.jPj(x). Either 
f*(x) occurs exterior to all of the e-insensitive balls or on the boundary of 
the ball surrounding V[. The assigned vertex vi is unique if the Pj(x) are 
distinct. 

To help the reader better understand the behavior of the nonlinear func- 
tion z i— > ^2jPj\\vj — z we plot it and its contour lines in Figure 2 for k = 3 
classes. The three class vertices are labeled clockwise starting with vertex 1 
in the first quadrant. Here we take e = ^y / 2k/(k — 1) to be the largest pos- 
sible value avoiding overlap of the interiors of the e-insensitive balls around 
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each vertex of the regular simplex. Figure 2 demonstrates that the optimal 
point varies with the probability vector p. When the highest probabilities 
are not unique (upper two panels of Figure 2), the optimal point falls sym- 
metrically between the competing vertices. When there is a dominant class 
(lower left panel of Figure 2), the optimal point falls on the boundary of the 
dominant ball. In the first case with p = (1/3, 1/3, 1/3), if we slowly increase 
pi and decrease p2 and P3 symmetrically, then the optimal point moves from 
the origin to the boundary of the ball surrounding vertex 1 (lower right of 
Figure 2). 

5. Stability selection. Stability selection [Meinshausen and Buehlmann 
(2010)] involves subsampling the data and keeping a tally of how often a 
given variable is selected. Each new subsample represents a random choice 
of half of the existing cases. Let Il£ be the empirical probability over the 
subsamples that variable k is selected under a particular value A of the 
penalty tuning constant; the universe of relevant tuning constants is denoted 
by A. Meinshausen and Buehlmann (2010) recommend 100 subsamples; the 
choice of A is left to the discretion of the user. A predictor k is considered 
pertinent whenever max^^f^ > ir for some fixed threshold 7r > ^. The 
set of pertinent predictors g stable is the final (or stable) set of predictors 
determined by this criterion. 

One of the appealing features of stability selection is that it controls for the 
number of false positives. Under certain natural assumptions, Meinshausen 
and Buehlmann (2010) demonstrate that the expected number of false pos- 
itives among the stable set is bounded above by the constant q 2 /[{2-k — l)p], 
where q is the average size of the random union = IJagA ^> an< ^ ^ ^ s ^ ne 
set of predictors selected at the given penalty level A in the corresponding 
random subsample. 

6. Simulation examples. 

6.1. Simulation example 1. The simulation examples of Wang and Shen 
(2007) offer an opportunity to compare VDAle, VDAl, and VDAe to the 
highly effective methods OVR and L1MSVM [Wang and Shen (2007)]. Ex- 
ample 1 of Wang and Shen (2007) specifies either k = 4 and k = 8 classes, 
n = 20k training observations, and p = 100 predictors. When observation i 
belongs to class c £ {1, . . . , k}, its jth predictor defined by 



where a± = d ■ cos[2(c — l)7r/fc] and ai = d- sin[2(c — l)n/k]. The Uij are in- 
dependent normal variates with mean and variance 1. Only the first and 




if 3 = 1,2, 
if j = 3,... 



100, 
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second predictors x\ and xi depend on an observation's class. The constant 
d determines the degree of overlap of the classes. The various combinations 
of k € {4, 8} and d € {1,2,3} generate six datasets of varying difficulty. The 
three datasets with k = 4 are underdetermined since n = 80 < p = 100; the 
datasets with k = 8 are overdetermined since n = 160 > p = 100. As recom- 
mended in Lange and Wu (2008), we take e = \\j2kj(k — 1), the largest pos- 
sible value avoiding overlap of the interiors of the e-insensitive balls around 
the vertices of the regular simplex. For all five methods, we chose the penalty 
tuning constants by minimizing assignment error on a separate testing sam- 
ple with 20,000 observations. Table 1 reports Bayes errors, optimal testing 
errors, number of variables selected, and elapsed training time in seconds 
(xlO 4 ) averaged over 100 random replicates. 

Table 1 shows that VDAle and VDAe outperform VDAl across all six 
datasets. Testing error is closely tied to the number of predictors selected. 
The addition of a lasso penalty gives VDAle a slight edge over VDAe- The 
competing method L1MSVM performs best overall by a narrow margin. The 
true predictors x\ and X2 are always selected by all three VDA methods. In 
reporting their results for L1MSVM and OVR, Wang and Shen (2007) omit 
mentioning the number of true predictors selected and computing times. 

6.2. Simulation example 2. In the second example of Wang and Shen 
(2007), k = 3, n = 60, and p varies over the set {10, 20, 40, 80, 160}. We now 
compare the three modified VDA methods with L1MSVM [Wang and Shen 
(2007)] and L2MSVM [Lee, Lin and Wahba (2004)]. The 60 training cases 
are spread evenly across the three classes. For p equal to 10, 20, and 40, 
discriminant analysis is overdetermined; the reverse holds for p equal to 80 
and 160. The predictors independent normal deviates with variance 

1 and mean for j > 2. For j < 2, Xij have mean Oj with 



Only the first two predictors are relevant to classification. 

We computed VDA testing errors over an independent dataset with 30,000 
observations and chose penalty tuning constants to minimize testing error 
over a grid of values. Table 2 reports Bayes errors, optimal testing errors, 
number of variables selected, and training times in seconds (xlO 4 ) averaged 
across 100 random replicates. In this example VDAle , VDAe, and L1MSVM 
rank first, second, and third, respectively, in testing error. Again there is a 
strong correlation between testing error and number of predictors selected, 
and the lasso penalty is effective in combination with the Euclidean penalty. 
The true predictors X\ and X2 are always selected by the three VDA meth- 
ods. 




Table 1 

Comparison of VDAle, VDAl, VDAe, L1MSVM, and OVR on simulation example 1. Here p = 100 and n = 20k; d is the multiplier in 
each simulation. Column 3 lists the Bayes error as a percentage. Column 5 reports the mean and 10%, 50%, and 90% percentiles of the 
number of nonzero variables. The remaining columns report average testing error, average number of nonzero variables, and average 
training time in seconds (xlO 4 ^ ) over 100 random replicates. The corrresponding standard errors for these averages appear in 
parentheses. Tuning constants were chosen to minimize error over a larger independent dataset with 20,000 observations. The results of 

L1MSVM and OVR are taken from the paper of Wang and Shen (2007) 



VDAle, VDAl and VDAe L1MSVM OVR 



k d 


Bayes (%) 


Error (%) 




# Var 


Time 


Error (%) 


# Var 


Error (%) 


# Var 


4 1 


36.42 


43.19 (0.09) 


2.80 


(0.11) 2,2,4 


80 (3) 


42.20 (0.09) 


2.20 (0.05) 


56.87 (0.25) 


67.17 (1.93) 






44.95 (0.22) 


8.16 


(0.84) 2,5,16 


73 (3) 














43.27 (0.08) 


2.42 


(0.05) 2,2,3 


114 (3) 










2 


14.47 


15.31 (0.03) 


2.07 


(0.02) 2,2,2 


115 (4) 


15.18 (0.04) 


2.06 (0.02) 


16.21 (0.09) 


5.72 (0.38) 






16.22 (0.11) 


3.79 


(0.28) 2,3,8 


112 (3) 














15.54 (0.04) 


2.13 


(0.04) 2,2,3 


139 (3) 










3 


3.33 


3.40 (0.01) 


2(0) 


2,2,2 


182 (13) 


3.35 (0.02) 


2.02 (0.01) 


3.50 (0.02) 


2.51 (0.13) 






3.80 (0.04) 


3.18 


(0.18) 2,2,5 


145 (7) 














3.52 (0.01) 


2.12 


(0.04) 2,2,2 


197 (8) 










8 1 


64.85 


70.94 (0.11) 


2.43 


(0.08) 2,2,4 


312 (10) 


70.47 (0.10) 


3.51 (0.16) 


79.76 (0.07) 


98.18 (0.29) 






74.77 (0.09) 


23.19 


(1.99) 2,18,51 


278 (6) 














70.81 (0.10) 


2.57 


(0.09) 2,2,4 


387 (3) 










2 


43.82 


51.09 (0.24) 


2.27 


(0.06) 2,2,3 


351 (10) 


46.86 (0.12) 


3.02 (0.12) 


66.72 (0.11) 


95.43 (0.25) 






58.37 (0.11) 


33.34 


(1.48) 15,32,52 


269 (6) 














50.50 (0.22) 


2.17 


(0.05) 2,2,3 


355 (9) 










3 


25.06 


37.93 (0.40) 


2.23 


(0.05) 2,2,3 


436 (9) 


27.95 (0.13) 


2.75 (0.17) 


55.84 (0.12) 


93.37 (0.21) 






46.91 (0.15) 


33.88 


(1.30) 17,32,50 


264 (5) 














33.26 (0.36) 


2.02 


(0.01) 2,2,2 


462 (4) 
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Table 2 

Comparison of VDAle, VDAl, VDAe, L1MSVM, and L2MSVM on simulation example 
2 with k = 3 and n — 60. Column 2 lists the Bayes error as a percentage. Column 4 
reports the 10%, 50%, and 90% percentiles of the number of nonzero variables. The 
remaining columns report average testing error, average number of nonzero variables, 
and average training time in seconds fxlO 4 ) over 100 random replicates. The 
corresponding standard errors for these averages appear in parentheses. The partial 
results for L1MSVM and L2MSVM are taken from the paper of Wang and Shen (2007) 



p 


Bayes (%) 


VDAle, VDA L and VDA E 




L1MSVM 


L2MSVM 


Error (%) 


# Var 


Time 


Error (%) 


Error (%) 


10 


10.81 


12.38 (0.10) 


2, 3, 4 


71 


(8) 


13.61 (0.12) 


15.44 (0.17) 






14.42 (0.14) 


2,3, 10 


50 


(8) 










12.70 (0.12) 


2, 3, 5 


74 


(8) 






20 


10.81 


12.65 (0.11) 


2, 4, 6 


104 


(7) 


14.06 (0.14) 


17.81 (0.22) 






15.38 (0.19) 


2, 4, 20 


43 


(7) 










13.08 (0.13) 


3, 5, 7 


130 


(7) 






40 


10.81 


13.01 (0.13) 


3, 5, 9 


178 


(10) 


14.94 (0.14) 


20.01 (0.22) 






15.66 (0.20) 


3, 5, 28 


56 


(7) 










13.50 (0.13) 


4, 7, 10 


247 


(8) 






80 


10.81 


13.33 (0.14) 


5,8, 13 


345 


(15) 


15.68 (0.15) 


21.81 (0.14) 






16.15 (0.22) 


4, 8, 32 


89 


(8) 










13.99 (0.15) 


8, 12, 17 


440 


(14) 






160 


10.81 


14.02 (0.14) 


3, 14, 19 


647 


(30) 


16.58 (0.17) 


27.54 (0.17) 






17.12 (0.23) 


6, 12, 51 


180 


(8) 










15.08 (0.19) 


14, 19, 26 


830 


(22) 







In one of the example 2 simulations, we applied stability selection [Mein- 
shausen and Buehlmann (2010)] to eliminate false positives. The left panel 
of Figure 3 shows that the true predictors X\ and X2 have much higher 
selection probabilities than the irrelevant predictors. Here we take p = 160 
predictors and 100 subsamples, fix A^ at 0.1, and vary A^. The right panel 
of Figure 3 plots the average number of selected variables. One can control 
the number of false positives by choosing the cutoff n. Higher values of it 
reduce both the number of false positives and the number of true positives. 
Here an excellent balance is struck for A^ between 0.1 and 0.2. 

6.3. Simulation examples 3 through 6. To better assess the accuracy of 
the three new VDA methods, we now present four three-class examples. In 
each example we generated 1000 predictors on 200 training observations and 
1000 testing observations. Unless stated to the contrary, all predictors were 
independent and normally distributed with mean and variance 1. Penalty 
tuning constants were chosen by minimizing prediction error on the testing 
data. We report average results from 50 random samples. 
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Selection Probabilities Average Number of Selected Variables 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 



Fig. 3. Stability selection with VDAle for p= 160. The left panel shows the empirical 
selection probabilities of all 160 predictors over 100 subsamples as a function of Xl for 
Xe fixed at 0.1. The first two predictors (red solid lines) stand out from the remaining 
predictors (black dash lines) with much higher selection probabilities. The right panel plots 
the average number of selected predictors (black solid line) and the expected number of 
falsely selected predictors for different values of the cutoff n. 



Example 3. This is a multi-logit model with odds ratios 

Prfclass = l\x) \ ~ Xil ~ Xi2 ~ Xi3 + Xi7 + Xi8 ' for class 1, 

lo S TTT1 ^TT = { Xi4 + Xi5 + Xi6 - x i7 - x i8 , for class 2, 

rrfclass = Six) -, c , Q 

v 1 ' K 1, tor class 6. 

These ratios and the constraint Yli=i P r (class = i) = 1 determine class as- 
signment. Obviously, only the first eight predictors are relevant to classifi- 
cation. 



Example 4. In this example observations are equally distributed over 
classes. For j < 5 the predictor Xij has mean a,j with 

( (0.5,0.5,1,0,0), for class 1, 

(ai, aii 0,3, 04, 05) = < (—0.5,-0.5,0,1,0), for class 2, 
[(0.5,-0.5,0,0,1), for class 3. 

The remaining predictors have mean and are irrelevant to classification. 

Example 5. This example is the same as example 3 except that the 
first six predictors are correlated with correlation coefficient p = 0.8. 

Example 6. This example is the same as example 4 except that the 
first six predictors are correlated with correlation coefficient p = 0.8. 
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Table 3 

Comparison of VDAle, VDAl, and VDAe- Each line reports for 50 random replicates 
average prediction error and 10%, 50%, and 90% percentiles of the number of nonzero 
variables and nonzero true variables selected. Standard errors for the average prediction 

errors appear in parentheses 



Method Error (%) # Var # True Var Error (%) 



# Var # True Var 



VDAle 

VDA l 

VDAe 



Example 3 



36.13 (0.36) 
37.57 (0.34) 
37.27 (0.38) 



17,58,219 
20,87,264 
13,28,65 

Example 5 



7,8,8 
7,8,8 
6,8,8 



VDAle 24.19 (0.27) 8,14,40 6,7,8 
VDA L 25.85 (0.29) 6,30,63 4,6,8 
VDAe 24.11 (0.29) 11,19,39 6,7,8 



Example 4 



31.65 (0.31) 
34.05 (0.31) 
32.11 (0.33) 



5, 11, 64 
8, 76, 214 
5, 8, 21 

Example 6 



5,5,5 
5,5,5 
5,5,5 



6.98 (0.19) 6,11,24 5,5,5 
10.78 (0.32) 5, 12, 37 5,5,5 
6.64 (0.19) 7, 19, 43 5,5,5 



Table 3 summarizes classification results for these examples. In all in- 
stances VDAle and VDAe show lower prediction error rates than VDAl- 
In examples 3 and 4, where predictors are independent, VDAle an d VDAl 
have much higher false positive rates than VDAe ■ In defense of VDAle , it 
has a lower prediction error and a higher true positive rate than VDAe in 
example 3. In examples 5 and 6, where predictors are correlated, VDAle 
and VDAe have much lower prediction errors than VDAl; they also tend 
to better VDAl in variable selection. 

7. Real data examples. 

7.1. Overdetermined problems. To test the performance of VDA mod- 
els on real data, we first analyzed four standard datasets (wine, glass, zoo, 
and lymphography) from the UCI machine learning repository [Murphy and 
Aha (1994)]. Table 4 compares the performance of the modified VDAs to 
the original VDAr, linear discriminant analysis (LDA), quadratic discrim- 
inant analysis (QDA), the A;-nearest-neighbor method (KNN), one-versus- 
rest binary support vector machines (OVR), classification and regression 
trees (CART), random forest prediction, and multicategory support vector 
machines (MSVM) [Lee, Lin and Wahba (2004)]. For all four datasets, the 
error rates in the table are average misclassification rates based on 10-fold 
cross-validation. We chose the penalty tuning constants for the various VDA 
methods to minimize cross-validated errors over a one- or two-dimensional 
grid. The entries of Table 4 demonstrate the effectiveness of VDAr on small- 
scale problems. Our more complicated method VDAle is a viable contender. 
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Table 4 

Mean 10-fold cross-validated testing error rates for empirical examples from the UCI 
data repository. The triples beneath each dataset give in order the number of classes k, 
the number of cases n, and the number of predictors p. NA stands for not available 





Wine 


Glass 


Zoo 


Lymphography 


Method 


(3, 178, 13) 


(6, 214, 10) 


(7, 101, 16) 


(4, 148, 18) 


VDAr 





0.2970 


0.0182 


0.0810 


VDAle 


0.0055 


0.3267 


0.0091 


0.1210 


VDA l 


0.0111 


0.3357 


0.0272 


0.1277 


VDA e 


0.0111 


0.3420 


0.0182 


0.1620 


LDA 


0.0112 


0.3972 


NA 


0.1486 


QDA 


0.0169 


NA 


NA 


NA 


KNN (k = 2) 


0.2697 


0.3084 


0.0594 


0.2432 


OVR 


0.0225 


0.3178 


0.0891 


0.1486 


CART 


0.0899 


0.4346 


0.2475 


0.2095 


Random forest 


0.0169 


0.2009 


0.0693 


0.1621 


MSVM 


0.0169 


0.3645 


NA 


NA 



7.2. Under determined problems. Our final examples are benchmark 
datasets for cancer diagnosis. These public domain datasets are character- 
ized by large numbers of predictors and include the cancers: colon [Alon 
et al. (1999)], leukemia [Golub et al. (1999)], prostate [Singh et al. (2002)], 
brain [Pomeroy et al. (2000)], lymphoma [Alizadeh et al. (2000)], and SR- 
BCT [Khan et al. (2001)]. We compare our classification results with those 
from three other studies [Li, Zhang and Jiang (2005); Statnikov et al. (2005); 
Dettling (2004)]. Table 5 summarizes all findings. The cited error rates for 
BagBoost [Dettling (2004)], Boosting [Dettling and Buhlmann (2003)], Ran- 
For, SVM, nearest shrunken centroids (PAM) [Tibshirani et al. (2002)], di- 
agonal linear discriminant analysis (DLDA) [Tibshirani et al. (2002)], and 
KNN appear in [Dettling (2004)]. The error rates in Table 5 are average 
misclassification rates based on 3-fold cross-validation. Again we chose the 
penalty tuning constants for the various versions of VDA by grid optimiza- 
tion. The error rates and training times listed in Table 5 are predicated on 
the selected tuning constants. 

Inspection of Table 5 suggests that VDAle may be superior to the popular 
classifiers listed. Although very fast, VDAl is not competitive with VDAle; 
VDAe performs well but falters on the lymphoma and brain examples. Ow- 
ing to the large number of predictors, application of VDAr is impractical in 
these examples. We also applied stability selection to the leukemia and SR- 
BCT data. As Figure 4 demonstrates, the expected number of false positives 
is small across a range of cutoff values ir. 
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Table 5 

Threefold cross-validated testing errors (as percentages) for six benchmark cancer 
datasets. The parenthesized triples for each dataset give in order the number of categories 
k, the number of cases n, and the number of predictors p. Column 2 and subsequent 
columns report average testing error (standard error in parentheses), 10%, 50%, and 
90% percentiles of number of nonzero variables, and the average training time in seconds 
over 50 random partitions. Execution times apply to the entire dataset under the optimal 
tuning parameters determined by cross-validation. All results for the non-VDA methods 
are taken from the paper of Dettlmg (2004 ) 



Method 


Error (%) # Var Time Error (%) 


# Var Time Error (%) 


# Var Time 




Leukemia (2,72,3571) 


Colon (2,62,2000) 


Prostate (2, 102, 6033) 


V JJAle 


1.56 18, 39, 74 0.50 


9.68 


10, 27, 103 0.15 


5.48 


16, 40, 53 1.15 




(0.15) 


(0.55) 




(0.33) 




vn a v 


7.14 26, 30, 85 0.08 


14.26 


19, 25, 147 0.04 


9.83 


30, 36, 200 0.23 




(0.62) 


(0.65) 




(0.56) 




vn A „ 


3.02 42, 54, 179 0.45 


11.08 


34, 42, 213 0.12 


6.76 


47, 57, 366 0.85 




(0.28) 


(0.52) 




(0.41) 




BagBoost 


4.08 


16.10 




7.53 




Boosting 


5.67 


19.14 




8.71 




RanFor 


1.92 


14.86 




9.00 




SVM 


1.83 


15.05 




7.88 




PAM 


3.75 


11.90 




16.53 




DLDA 


2.92 


12.86 




14.18 




KNN 


3.83 


16.38 




10.59 






Lymphoma (3,62,4026) 


SRBCT (4,63,2308) 


Brain (5, 42, 5597) 


VDAle 


1.66 39, 69, 97 1.47 


1.58 


45, 60, 94 1.78 


23.80 


52, 78, 98 4.39 




(0.27) 


(0.77) 




(1.54) 




VDA l 


14.36 39, 53, 86 0.12 


9.52 


43, 53, 65 0.11 


48.86 


46, 57, 66 0.38 




(0.97) 


(1.14) 




(1.43) 




VDA e 


3.25 80, 91, 128 2.01 


1.58 


58, 70, 106 1.70 


30.44 


70, 85, 100 6.43 




(0.38) 


(0.92) 




(1.76) 




BagBoost 


1.62 


1.24 




23.86 




Boosting 


6.29 


6.19 




27.57 




RanFor 


1.24 


3.71 




33.71 




SVM 


1.62 


2.00 




28.29 




PAM 


5.33 


2.10 




25.29 




DLDA 


2.19 


2.19 




28.57 




KNN 


1.52 


1.43 




29.71 





8. Discussion. As one of the most important branches of applied statis- 
tics, discriminant analysis continues to attract the attention of theoreticians. 
Although the flux of new statistical demands and ideas has not produced 
a clear winner among the competing methods, we hope to have convinced 
readers that VDA and its various modifications are competitive. It is easy to 
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Fig. 4. Stability selection with VTMle for leukemia (upper panel with k = 2) and SRBCT 
(lower panel with k — 3) data. The left panels plot the empirical selection probabilities of 
all predictors over 100 subsamples as a function of Xl for Xe fixed at 0.1. The right panel 
plots the average number of selected predictors (black solid line) and the expected number 
of falsely selected predictors for different values of the cutoff n. 



summarize the virtues of VDA in four words: parsimony, robustness, speed, 
and symmetry. VDAr excels in robustness and symmetry but falls behind in 
parsimony and speed. We recommend it highly for problems with a handful 
of predictors. VDAe excels in robustness, speed, and symmetry. On high- 
dimensional problems it does not perform quite as well as VDAlei which 
sacrifices a little symmetry for extra parsimony. Apparently, VDAl puts too 
high a premium on parsimony at the expense of symmetry. 

Our Euclidean penalties tie together the parameters corresponding to 
a single predictor. Some applications may require novel ways of grouping 
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predictors. For example in cancer diagnosis, genes in the same biological 
pathway could be grouped. If reliable grouping information is available, then 
one should contemplate adding further Euclidean penalties [Wu and Lange 
(2008)]. If other kinds of structures exist, one should opt for different penalty 
functions. For example, Yuan, Joseph and Zou (2009) and Wu, Zou and 
Yuan (2008) discuss the problem of how to retain hierarchical structures in 
variable selection using the nonnegative garrote [Breiman (1995)]. 

The class vertices in VDA are symmetrically distributed over the surface 
of the unit ball. When categories are ordered or partially ordered, equidis- 
tant vertices may not be optimal. The question of how to incorporate order 
constraints deserves further investigation. The simplest device for three or- 
dered categories is to identify them with the points —1, 0, and 1 on the 
line. 

Future applications of discriminant analysis will confront even larger 
datasets. Computing times are apt to balloon out of control unless the right 
methods are employed. Cyclic coordinate descent has proved to be extraor- 
dinarily fast when coupled with lasso or Euclidean penalties. The same speed 
advantages are seen in lasso penalized regression and generalized linear mod- 
els. Further gains in speed may well come in parallel computing. Statisti- 
cians have been slow to plunge into parallel computing because of the extra 
programming effort required and the lack of portability across computing 
platforms. It is not clear how best to exploit parallel computing with VDA. 

Stability selection as sketched by Meinshausen and Buehlmann (2010) 
appears to work well with VDA. In our simulated example, it eliminates 
virtually all irrelevant predictors while retaining the true predictors. For the 
cancer data, the true predictors are unknown; it is encouraging that the 
expected number of false positives is very low. Because stability selection 
requires repeated subsampling of the data, users will pay a computational 
price. This cost is not excessive for VDA, and we highly recommend stability 
selection. In our view it will almost certainly replace cross-validation in 
model selection. 

The theoretical underpinnings of VDA and many other methods of dis- 
criminant analysis are weak. We prove Fisher consistency here, but more 
needs to be done. For instance, it would be reassuring if someone could vin- 
dicate our intuition that shrinkage is largely irrelevant to classification by 
VDA. Although it is probably inevitable that statistical practice will outrun 
statistical theory in discriminant analysis, ultimately there is no stronger 
tether to reality than a good theory. Of course, a bad or irrelevant theory is 
a waste of time. 

SUPPLEMENTARY MATERIAL 

Supplementary File: Proof of Proposition 1 (DOI: 10.1214/10-AOAS345SUPP; 
.pdf). We prove Fisher consistency of e-insensitive loss in this paper. 



24 



T. T. WU AND K. LANGE 



REFERENCES 

Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C, Lossos, I. S., Rosenwald, A., 
Boldrick, J. C, Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. 

E. , Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., 
Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, 
R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. 
O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified 
by gene expression profiling. Nature 403 503-511. 

Alon, U., Barkai, N., Notterman, D., Gish, K., Mack, S. and Levine, J. (1999). 
Broad patterns of gene expression revealed by clustering analysis of tumor and normal 
colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 96 6745- 
6750. 

Bredensteiner, E. and Bennett, K. (1999). Multicategory classification by support 
vector machines. Comput. Optim. Appl. 12 53-79. MR1704101 

Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technomet- 
rics 37 373-384. MR1365720 

Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass 
kernel-based vector machines. J. Mach. Learn. Res. 2 265-292. 

Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. 
Bioinformatics 20 3583-3593. 

Dettling, M. and Buhlmann, P. (2003). Boosting for tumor classification with gene 
expression data. Bioinformatics 19 1061-1069. MR2498230 

Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination 
methods for the classification of tumors using gene expression data. Amer. Statist. 
Assoc. 97 77-87. MR1963389 

Friedman, J., Hastie, T., Hoefling, H. and Tibshirani, R. (2007). Pathwise coordi- 
nate optimization. Ann. Appl. Statist. 2 302-332. MR2415737 

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C, Gaasenbeek, M., Mesirov, 
J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, 
C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery 
and class prediction by gene expression monitoring. Science 286 531-537. 

Guermeur, Y. (2002). Combining discriminant models with new multiclass SVMS. Pat- 
tern Analysis & Applications 5 168-179. MR1922450 

Khan, J., Weil, J. S., Markus Ringnr, M., Saal, L. H., Ladanyi, M., Westermann, 

F. , Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, 
P. S. (2001). Classification and diagnostic prediction of cancers using gene expression 
profiling and artificial neural networks. Nat. Med. 7 673-679. 

Kressel, U. (1999). Pairwise classification and support vector machines. In Advances in 
Kernel Methods: Support Vector Learning, Chap. 15. MIT Press, Cambridge, MA. 

Lange, K. and Wu, T. T. (2008). An MM algorithm for multicategory vertex discrimi- 
nant analysis. J. Comput. Graph. Statist. 17 527-544. MR2528236 

Lee, Y., Lin, Y. and Wahba, G. (2004). Multicategory support vector machines: Theory 
and application to the classification of microarray data and satellite radiance data. J. 
Amer. Statist. Assoc. 99 67-81. MR2054287 

Li, H., Zhang, K. and Jiang, T. (2005). Robust and accurate cancer classification with 
gene expression profiling. In Proc. LEEE Comput. Syst. Bioinform. Conf.: 8-11 August 
2005 310-321. California. 

Li, T., Zhang, C. and Ogihara, M. (2004). A comparative study of feature selection 
and multiclass classification methods for tissue classification based on gene expression. 
Bioinformatics 20 2429-2437. 



MULTICATEGORY VDA FOR HIGH-DIMENSIONAL DATA 



25 



Liu, Y. (2007). Fisher consistency of multicategory support vector machines. Eleventh 

International Conference on Artificial Intelligence and Statistics 2 289-296. 
Liu, Y., Shen, X. and DOSS, H. (2005). Multicategory ^-learning and support vector 

machine: Computational tools. J. Comput. Graph. Statist. 14 219-236. MR2137899 
Liu, Y., Shen, X. and Doss, H. (2006). Multicategory -^-learning. J. Amer. Statist. 

Assoc. 101 500-509. MR2256170 
Meinshausen, N. and Buehlmann, P. (2010). Stability selection (with discussion). J. 

Roy. Statist. Soc. Ser. B DOI: 10.1111/j. 1467-9868. 2010.00740.x. 
Murphy, P. M. and Aha, D. W. (1994). UCI Repository of machine learning databases. 

Available at http://www.ics.uci.edu/mlearn/~MLRepository.html. 
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., 

McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, 

C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, 

J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., 

Louis, D. N., Mesirov, J. P., Lander, E. S. and Golub, T. R. (2000). Molecular 

portraits of human breast tumours. Nature 406 747-752. 
Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., 

Renshaw, A., D'Amico, A. and Richie, J. (2002). Gene expression correlates of 

clinical prostate cancer behavior. Cancer Cell 1 203-209. 
Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D. and Levy, S. (2005). A 

comprehensive evaluation of multicategory classification methods for microarray gene 

expression cancer diagnosis. Bioinformatics 21 631-643. 
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple 

cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 

6567-6572. 

Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York. MR1641250 

Wang, L. and Shen, X. (2007). On Li-norm multiclass support vector machines. J. Amer. 

Statist. Assoc. 583-594. MR2370855 
Weston, J. and Watkins, C. (1999). Support vector machines for multi-class pattern 

recognition. In Proceedings of the Seventh European Symposium on Artificial Neural 

Networks 219-224. 

Wu, S., Zou, H. and Yuan, M. (2008). Structured variable selection in support vector 
machines. Elec. J. Stat. 2 103-117. MR2386088 

Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized 
regression. Ann. Appl. Statist. 2 224-244. MR2415601 

Wu, T. T. and Lange, K. (2010). Supplement to "Multicategory vertex discriminant 
analysis for high-dimensional data." DOI: 10.1214/10-AOAS345SUPP. 

Yuan, M., Joseph, R. and Zou, H. (2009). Structured variable selection and estimation. 
Ann. Appl. Statist. 3 1738-1757. 

Zhang, T. (2004a). Statistical analysis of some multi-category large margin classification 
methods. J. Mach. Learn. Res. 5 1225-1251. MR2248016 

Zhang, T. (2004b). Statistical behavior and consistency of classification methods based 
on convex risk minimization. Ann. Statist. 32 469-475. MR2051001 

Zhang, H. H., Liu, Y., Wu, Y. and Zhu, J. (2008). Variable selection for the multicat- 
egory SVM via adaptive sup-norm regularization. Elec. J. Stat. 2 149-167. MR2386091 

Zou, H., Zhu, J. and Hastie, T. (2006). The margin vector, admissible loss, and multi- 
class margin-based classifiers. Technical report, Dept. Statistics, Stanford Univ. 

Zou, H., Zhu, J. and Hastie, T. (2008). New multicategory boosting algorithms based 
on multicategory Fisher-consistent losses. Ann. Appl. Statist. 2 1290-1306. 



26 



T. T. WU AND K. LANGE 



Department of Epidemiology 

AND BlOSTATISTICS 

University of Maryland 
1242C SPH Building 
College Park, Maryland 20707 
USA 

E-MAIL: ttwu@umd.cdu 



Departments of Biomathematics 

and Human Genetics 
Geffen School of Medicine at UCLA 
Gonda Research Center 
695 Charles E. Young Drive South 
Los Angeles, California 90095-7088 
USA 

E-MAIL: klange@ucla.edu 



